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ABSTRACT 

There has been substantial progress with formal models for 
sequential decision making by individual agents using the 
Markov decision process (MDP). However, similar treatment 
of multi- agent systems is lacking. A recent complexity re- 
sult, showing that solving decentralized MDPs is NEXP- 
hard, provides a partial explanation. To overcome this com- 
plexity barrier, we identify a general class of transition- 
independent decentralized MDPs that is widely applicable. 
The class consists of independent collaborating agents that 
are tied up by a global reward function that depends on both 
of their histories. We present a novel algorithm for solving 
this class of problems and examine its properties. The result 
is the first effective technique to solve optimally a class of 
decentralized MDPs. This lays the foundation for further 
work in this area on both exact and approximate solutions. 

1. INTRODUCTION 

There has been a growing interest in recent years in formal 
models to control collaborative multi-agent systems. Some 
of these efforts have focused on extensions of the Markov de- 
cision process (MDP) to multiple agents, following substan- 
tial progress with the application of such models to prob- 
lems involving single agents. Examples of these attempts 
include the multi-agent Markov decision process (MMDP) 
proposed by Boutilier [2], the Communicative Multiagent 
Team Decision Problem (COM-MTDP) proposed by Pyna- 
dath and Tambe [11], and the decentralized Markov decision 
process (DEC-POMDP and DEC-MDP) proposed by Bern- 
stein, Givan, Immerman and Zilberstein [1]. The MMDP 
model is based on full observability of the global state by 
each agent. Similar assumptions have been made in recent 
work on multi- agent reinforcement learning [3]. We are in- 
terested in situations in which each agent might have a dif- 
ferent partial view of the global state. This is captured by 
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the DEC-POMDP model, in which a single global Markov 
process is controlled by multiple agents, each of which re- 
ceives partial information about the global state after each 
action is taken. A DEC-MDP is a DEC-POMDP with the 
restriction that at each time step the agents’ observations 
together uniquely determine the global state. 

A recent complexity study of decentralized control shows 
that solving such problems is extremely difficult. The com- 
plexity of both DEC-POMDP and DEC-MDP is NEXP- 
hard, even when only two agents are involved [1]. This is in 
contrast to the best known bounds for MDPs (P-hard) and 
POMDPs (PSPACE-hard) [9, 6]. The few recent studies of 
decentralized control problems (with or without communi- 
cation between the agents) confirm that solving even simple 
problem instances is extremely hard [11, 14]. 

One way to overcome this complexity barrier is to exploit 
the structure of the domain offered by some special classes of 
DEC-MDPs. We study one such class of problems, in which 
two agents operate independently but are tied together by 
some reward structure that depends on both of their exe- 
cution histories. The model is motivated by the problem 
of controlling the operation of multiple space exploration 
rovers, such as the ones used by NASA to explore the sur- 
face of Mars [13]. Periodically, such rovers are in communi- 
cation with a ground control center. During that time, the 
rovers transmit the scientific data they had collected and 
receive a new mission for the next period. The mission in- 
volves visiting several sites at which the rovers could take 
pictures, conduct experiments, and collect data. Each rover 
must operate autonomously (including no communication 
between them) until communication with the control center 
is feasible. Because the rovers have limited resources (com- 
puting power, electricity, memory, and time) and because 
there is uncertainty about the amount of resources that are 
consumed at each site, a rover may not be able to complete 
its mission entirely. In previous work, we have shown how to 
model and solve the single rover control problem by creating 
a corresponding MDP [15]. 

When two rovers are deployed, each with its own mis- 
sion, there is important interaction between the activities 
they perform. Two activities may be complementary (e.g., 
taking pictures of the two sides of a rock), or they may 
be redundant (e.g., taking two spectrometer readings of the 
same rock). Both complementary and redundant activities 
present a problem: the global utility function is no longer 
additive over the two agents. When experiments provide 
redundant information, there is little additional value to 


completing both so the global value is subadditive. When 
experiments are complementary, completing just one may 
have little value so the global value is superadditive. The 
problem is to find a pair of policies, one for each rover, that 
maximizes the value of the information received by ground 
control. 

We develop in this paper a general framework to handle 
such problems. Pynadath and Tambe [ 11 ] studied the per- 
formance of some existing teamwork theories from a decision- 
theoretic perspective, similar to the one we develop here. 
However, our technique exploits the structure of the prob- 
lem to find the optimal solution more efficiently. 

A simplified version of our problem has been studied by 
Ooi and Wornell [ 7 ], under the assumption that all agents 
share state information every K time steps. A dynamic pro- 
gramming algorithm has been developed to derive optimal 
policies for this case. A downside of this approach is that the 
state space for the dynamic programming algorithm grows 
doubly exponentially with K . The only known tractable al- 
gorithms for these types of problems rely on even more as- 
sumptions. One such algorithm was developed by Hsu and 
Marcus [ 5 ] and works under the assumption that the agents 
share state information every time step (although it can take 
one time step for the information to propagate). Approxi- 
mation algorithms have also been developed for these prob- 
lems, although they can at best give guarantees of local op- 
timality. For instance, Feshkin et al. [ 10 ] studied algorithms 
that perform gradient descent in a space of parameterized 
policies. 

The rest of the paper is organized as follows. Section 2 
provides a formal description of the class of problems we 
consider. Section 3 presents the coverage set algorithm, 
which optimally solves this class of problems, and proves 
that the algorithm is both complete and optimal. Section 
4 illustrates how the algorithm performs on a simple sce- 
nario modeled after the planetary rover. We conclude with 
a summary of the contributions of this work. 

2. FORMAL PROBLEM DESCRIPTION 

In this section we formalize the rovers control problem 
as a transition-independent, cooperative, decentralized de- 
cision problem. The domain involves two agents operating 
in a decentralized manner, choosing actions based upon their 
own local and incomplete view of the world. The agents are 
cooperative in the sense that there is one value function for 
the system as a whole that is being maximized 1 . 

Definition 1. A factored, 2-agent , DEC- M DP is defined 
by a tuple < S,A,P,R >, where 

• S = 5*1 x S 2 is a finite set of states, with a distinguished 
initial state (so>so)* indicates the set of states of 
agent i. 

• A = A\ x A2 is a finite set of actions. Ai indicates the 
action taken by agent i. 

• P is a transition function. P(($i, s 2 )|(si, $2), (ai, a 2 )) 
is the probability of the outcome state (si, si) when the 
action pair (01,02) is taken in state (si,S2). 

*In this paper, we assume that the agents cannot commu- 
nicate between execution episodes, so they need to derive 
an optimal joint policy that involves no exchange of infor- 
mation. We are also studying decentralized MDPs in which 
the agents can communicate at some cost [ 4 ]. 


• R is a reward function. R((si, s 2 ), (oi, 02 ), (si, si)) is 
the reward obtained from taking actions (01,02) from 
state (si , S2) and transitioning to state (si, si). 

We assume that agent i observes s*, hence the two agents 
have joint observability. We call components of the factored 
DEC-MDP that apply to just one agent local, , for example 
Si and ai are local states and local actions for agent i. A 
policy for each agent, denoted by 7 n, is a mapping from local 
states to local actions. -TTi(si) denotes the action taken by 
agent i in state Si. A joint policy is a pair of policies, one 
for each agent. 

Definition 2. A factored, 2-agent DEC-MDP is said to 
be transition independent if there eooist P\ and P2 such 
that 

•P((si.S2)l(si.S2),(<ii,aj)) = Pi(si|si,ai) • P 2 (s 2 |s2,a2) 

That is, the new local state of each agent depends only on 
the previous local state and the action taken by that agent. 

Definition 3. A factored, 2-agent DEC-MDP is said to 
be reward independent if there exist R\ and R2 such that 

fi((si,s 2 ),(ai,a 2 ),(si,si)) = i?i(si, ai, si) 4 - P 2 (s2, a 2 , s' 2 ) 

That is, the overall reward is composed of the sum of two 
local reward functions, each of which depends only on the 
local state and action of one of the agents. 

Obviously, if a DEC-MDP is both transition independent 
and reward independent, then it can be formulated and 
solved as two separate MDPs. However, if it satisfies just 
one of the independence properties it remains a non-trivial 
class of DEC-MDPs. We are interested in the general class 
that exhibits transition independence without reward inde- 
pendence. The Mars rovers application is an example of 
such a domain. The local states capture the positions of 
the rovers and the available resources. The actions involve 
various data collecting actions at the current site or the de- 
cision to skip to the next site. The overall reward, however, 
depends on the value of the data collected by the two rovers 
in a non-additive way. 

2.1 Joint Reward Structures 

We now introduce further structure into the global reward 
function. To define it, we need to introduce the notion of an 
occurrence of an event during the execution of a local policy. 

Definition 4. A history, $ = [so,ao,si,ai, ...] is a se- 
quence that records all the local states and actions for one of 
the agents, starting with the local initial state of that agent. 

Definition 5. A primitive event, e = (s,a, s') is a 
triplet that includes a state, an action, and an outcome state. 
An event E = {ei,e 2 f e m } is a set of primitive events. 

Definition 6. A primitive event e = (s,a, s') occurs m 
history 4 >, denoted 4 > e if the triplet (s, a, s') appears as a 
subsequence of 4>. An event E = {ei, e 2 , ..., e m } occurs in 
history denoted $ \= E iff 

3e 6E E : $ |= e. 

Events are used to capture the fact that an agent accom- 
plished some task. In some cases a single local state may be 



sufficient to signify the completion of a task. But because of 
the uncertainty in the domain and because tasks could be 
accomplished in many different ways, we generally need a 
set of primitive events to capture the completion of a task. 
To illustrate this in the rover example, we will define the 
state to be < t , l > where t is the time left in the day and 
l is the location of the current data collection site. The ac- 
tions are skip and collect The event took picture of site 4 
would be described by the following set of primitive events: 

{(< t,l >, a, < t',l' >)| < U > = < *, 4 >, 
a -= collect, < t\ l* >=< *, 5 >}.. 

Definition 7. A primitive event is said to be proper if 
it can occur at most once in any possible history of a given 
MDP . That is: 

V<f> = $ie<I>2 : f= e) A -*($ 2 (= e) 

DEFINITION 8. An event E = {ei,e 2 , e n } is said to be 
proper if it consists of mutually exclusive proper primitive 
events with respect to some given MDP . That is: 

V<$ -»3 i ^ j : (ei € E A ej € E A $ (= a A $ |= ej) 

We limit the discussion in this paper to proper events. 
They are sufficiently expressive for the rover domain and 
for the other applications we consider, while simplifying the 
discussion. Later we show how some non-proper events can 
be modeled in this framework. 

DEFINITION 9. Let and $2 be histories of two transition- 
independent MDPs . A joint reward structure 

p =[(£{,£?, c 1 ),....,(Ei,l£ l c n )] 1 

specifies the reward (or penalty) c* that is added to the global 
value function if $ 1 | — El and 4 > 2 (= E%> 

We limit the discussion in this paper to factored DEC- 
MDPs that are transition-independent, and whose global 
reward function R is composed of two local reward func- 
tions R\ and R 2 for each agent plus a joint reward struc- 
ture p. This allows us to refer to the underlying MDP , 

< S», A{, Pi, Ri > even though the problem is not reward 
independent. 

Given a local policy, it, the probability that a primitive 
event e = (s,a, s') will occur during any execution of it, 
denoted P(e|7r) can be expressed as follows. 

X 

P(«M = 2>t(sk)P(a|5,7r)P( S '| S ,o) (1) 

t= 0 

where P t (s |tt) is the probability of being in state s at time 
step t. Pt(s|7r) can be easily computed for a given MDP 
from its transition model; P(a|s, it) is simply 1 if n(s) = 
a and 0 otherwise; and P(s'|s,a) is simply the transition 
probability. Similarly, the probability that a proper event 
E — {ei, e 2 , ..., e m } will occur is: 

P(£K) = £ P(e I*). (2) 

e€E 

We can sum the probabilities over the primitive events in E 
because they are mutually exclusive. 


DEFINITION 10. Given a joint policy (ir\,it 2 ) and a joint 
reward structure p, the joint value is: 

n 

JVifrun) = 

i-1 

DEFINITION 11. A global value function of a transition- 
independent decentralized MDP with respect to a joint policy 
(7Ti,7T2), local reward functions Pi,P 2 and a joint reward 
structure p is: 

GV( 7r,,jr 2 ) -%,(«£) + V^(sl) + JV(p\n U 7t 2 ) 

where V^sJ) is the standard value of the underlying MDP 
for agent i at Sq given policy it{ . 

While Vir(s) is generally interpreted to be the value of 
state s given policy it, we will sometimes refer to it as the 
value of it because we are only interested in the value of the 
initial state given it. 

The goal is to find a joint policy that maximizes the global 
value function. 

Definition 12. An optimal joint policy, denoted 
(tti , 7t 2 )*, is a pair of policies that maximize the global value 
function , that is: 

On ,n 2 y = argmax*. ^G’V'On.’n)- 

2.2 Expressiveness of the Model 

TVansition-independent DEC-MDPs with a joint reward 
structure may seem to represent a small set of domains, but 
it turns out that the class of problems we address is quite 
general. Many problems that do not seem to be in this class 
can actually be represented by adding extra information to 
the global state. In particular, some problems that do not 
naturally adhere to the mutual exclusion among primitive 
events can be handled in this manner. The mutual exclusion 
property guarantees that exactly one primitive event within 
an event set can occur. We discuss below some cases violat- 
ing this assumption and how they can be treated within the 
framework we developed. 

At least one primitive event - Suppose that multiple 
primitive events within the set can occur and that an addi- 
tional reward is added when at least one of them does occur. 
In this case the state can be augmented with one bit per 
event that is initially 0. When a primitive event in the set 
occurs, the bit is set to 1. If we redefine each primitive event 
so that the corresponding bit switches from 0 to 1 when it 
occurs, we make the event set proper because the bit can 
switch from 0 to 1 only once in every possible history. 

All primitive events - Suppose that an event is com- 
posed of n primitive events, all of which must occur to trig- 
ger the extra reward. In this case, each local state must 
be augmented with n bits, one bit for each primitive event 
(ordering constraints among the primitive events could be 
exploited to reduce the number of bits necessary). The new 
set of events that describe the activity are those that flip 
the last bit from 0 to 1 . 

Counting occurrences - Suppose that an event is based 
on a primitive event (or another event set) repeating at least 
n times. Here, the local state can be augmented with log n 
bits to be used as a counter. The extra reward is triggered 
when the desired number of occurrences is reached, at which 
point the counting stops. 



1 


Temporal constraints - So far we have focused on global 
reward structures that do not impose any temporal con- 
straints on the agents. Some temporal constraints can also 
be represented in this framework if time is enumerated as 
part of the local states. For example, suppose that event 
Ei in agent 1 facilitates event E2 in agent 2, that is, the 
occurrence of Ei before E2 leads to some extra reward when 
E2 occurs. If we include time in local states, then we could 
include in the joint reward structure a triplet for every possi- 
ble combination of primitive events that satisfy the temporal 
constraint. Obviously, this leads to creating large numbers 
of event sets. A more compact representation of temporal 
constraints remains the subject of future research. 

To summarize, there is a wide range of practical prob- 
lems that can be represented within our framework. Non- 
temporal constraints tend to have a more natural, compact 
representation, but some temporal constraints can also be 
captured. 

3. COVERAGE SET ALGORITHM 

In this section, we develop a general algorithm for solving 
transition-independent, factored, 2-agent DEC-MDPs with 
a global value function shown in Definition 11 . The algo- 
rithm returns the optimal joint policy. 

So far, no optimal algorithm has been presented for solv- 
ing the general class of DEC-MDPs, short of complete enu- 
meration and evaluation of policies. Some search algorithms 
in policy space and gradient decent techniques have been 
used to find approximate solutions, with no guarantee of 
convergence in the limit on the optimal joint policy (e.g., 
[10]). Here we present a novel algorithm that utilizes the 
structure of the problem to find the optimal joint policy. 
(Throughout the discussion we use i to refer to one agent 
and j to refer to the other agent.) 

The algorithm is divided up into three major parts: 

1. Create augmented MDPs. An augmented MDP repre- 
sents one agent’s underlying MDP with an augmented 
reward function. 

2. Find the optimal coverage set for the augmented MDPs, 
which is the set of all optimal policies for one agent 
that correspond to any possible policy of the other 
agent. As we show below, this set can be represented 
compactly. 

3. Find for each policy in the optimal coverage set the 
corresponding best policy for the other agent. Return 
the best among this set of joint policies, which is the 
optimal joint policy. 

Pseudo-code for the coverage set algorithm is shown in 
Figure 1. The main function, CSA, takes a factored DEC- 
MDP and a joint reward structure as described in Section 
2, and returns an optimal joint policy. 

3.1 Creating Augmented MDPs 

The first step in the algorithm is to create the augmented 
MDPs, which are derived from the underlying MDPs for 
each agent with an augmented reward function. The new re- 
ward is calculated from the original reward, the joint reward 
structure and the policy of the other agent. The influence 
of the other agent’s policy on the augmented MDP can be 
captured by a vector of probabilities, which is a point in the 
following parameter space. 


function CSA(MDPi, MDP 2, p) 
returns the optimal joint policy 
inputs: MDP\, underlying MDP for agent 1 

MDP 2, underlying MDP for agent 2 
p, joint reward structure 

optset <- COVERAGE-SET(AfDPi,p) 

value « 00 

jointpolicy «— {} 

/* find best joint optimal policy */ 
for each policyi in optset 

policy- 2 <- SOLVE(AUGMENT(MDF 2 ,po/ict/ 1 ,p)) 
v <— GV {{policy i,policy2), MDP\, MDP 2 ,p) 
if (v > value) 

then value «— v 

else jointpolicy *— {policyi, policy 2} 
return jointpolicy 

function COVERAGE-SET (MDP, p) 

returns set of all optimal policies with respect to p 
inputs: MDP , arbitrary MDP 

p, joint reward structure 

planes <— {} /* planes are equivalent to policies */ 
points {} 

/* initialize boundaries of parameter space */ 
for i 1 to |p| 

boundaries <— boundaries U {xi = 0, = 1} 

/* loop until no new optimal policies found */ 

do 

newplanes <— {} 

points +» INTERSECT(pJanes u boundaries) 

/* get optimal plane at each point */ 
for each point in points 

plane «- SOLVE(AUGMENT (MDP, point, p) 
if plane not in planes 

then newplanes «— newplanes U {plane} 
planes «— planes U newplanes 
while |neu>p/anes| > 0 
return planes 


Figure 1: Coverage Set Algorithm 

Definition 13. The parameter space is an n dimen- 
sional space where each dimension has a range of[ 0, 1]. Each 
policy 7 Xj has a corresponding point in the parameter space, 
x- Kj , which measures the probabilities that each one of the n 
events will occur when agent j follows policy itj : 

= \P(Ei\* j ),P(Ei\7' j ),...,P(Ei\* j )}. 

Given a point in the parameter space, x nj , agent i can de- 
fine a decision problem that accurately represents the global 
value instead of its local value. It can do this because both 
the joint reward structure and agent j’s policy are fixed. The 
new decision problem is defined by an augmented MDP. 

Definition 14. An augmented MDP, denoted MDP * Wj , 
is defined as < Si, A iy P iy R' it So,x nj , p >, where is a 
point in parameter space computed from the policy for agent 
j, p is the joint reward structure and R[ is: 

\p\ 

#i(e) = Ri(e) + ]T 8 k P(Ei\*j)c k , 

k - 1 

where 5 * = { J 




Note that for e = (s, a, s'), R(e) is the same as R(s, a, s'). 

THEOREM 1. The value of a policy 7T* over MDP i " j is: 

^(4) = V Wl (^) + JV(p|ir < ,ir J ). 

Proof. The value of an MDP given a policy can be cal- 
culated by summing over all time steps t and all events e, 
the probability of seeing e after exactly t steps, times the 
reward gained from e: 

( 4 ) = 


IpI 


= ££«w*o *<(«) + 


fc=i 


- EE Pt(e|7Ti)/ii(e) + 


|p| 


t=0 c *=1 


|p| 


K.(4) + £ P(££K)c* £ £ A(e|iri) 

k = l t=O e€ E* 


Ipl 


= V..(»S) + £P(^|ir i )P(£t|ir 1 )c t 


= VvAs'o) + ^V(p|7T<,7rj). 


The function AUGMENT in Figure 1 takes an MDP, a 
policy and a joint reward structure and returns an aug- 
mented MDP according to Definition 14. 

The global value is equal to the local value of a policy for 
one agent plus the value of the corresponding augmented 
MDP for the other agent. This allows an augmented MDP 
to be used in an iterative, hill-climbing algorithm. In such 
an algorithm, the policy for agent j, 7Tj, is held fixed and the 
optimal corresponding policy for agent i is found by finding 
the optimal policy for MDP < . Then the roles are reversed, 

and agent i’s new policy is held fixed while the optimal cor- 
responding policy for agent j is found. This process repeats 
until neither policy changes, i.e., until a Nash equilibrium is 
reached. 


Proposition 1. An optimal joint policy is a 

Nash equilibrium over the augmented MDPs: 



= max*£*(«8). 
*2 . 


Proof. 

Assume 3n[ ^ m : > V’** 2 (sq). From Definition 

11 and Theorem 1: 

GV(, ri.ira) = V*,’ 2 (s l 0 ) + 1^(4) 

GV(i ti,jt 2 ) = v£’ 3 (4) + ^ a (so) 


This contradicts (7ri,7r2)*. By symmetry, we can show the 
same for 7 : 2 . Therefore, the optimal joint policy is a Nash 
equilibrium over augmented MDPs. □ 

While the hill-climbing algorithm reaches a Nash equilib- 
rium, and the optimal joint policy is a Nash equilibrium, 
there is no guarantee that they are the same equilibrium. 
In many problems there are many local optima, with the 
optimal joint policy being the best of them. However, the 
hill-climbing algorithm, even with random restarts, does not 
guarantee finding the global maxima. 

3.2 Finding the Optimal Coverage Set 

An augmented MDP is defined over the parameter space, 
which is a continuous space. This means that for both 
agents, there is an infinite number of augmented MDPs. 
however, only a finite number of them are potentially use- 
ful: the ones where the point in parameter space corresponds 
to a policy of the other agent. This set is still quite large 
since the number of policies is exponential in the number 
of states. It turns out that in practice, only a very small 
number of these augmented MDPs are really interesting. 

Given any particular augmented MDP, the only policy 
that is useful is the policy that maximizes the expected 
value of that MDP. Fortunately, most of the augmented 
MDPs have the same optimal policy , and the set of inter- 
esting augmented MDPs can be reduced to one per optimal 
policy. This set of optimal policies, the optimal coverage 
set, is what we are really after and the augmented MDPs 
provide a way to find them. 

Definition 15. The optimal coverage set, Oi, is the 
set of optimal policies for MDPf given any point in param- 
eter space, x: 

Oi = { 7r* | 3f,7Tt = argmax„,V^(so)}. 

Another way to look at the optimal coverage set is to 
examine the geometric representation of a policy over the 
parameter space. The value of a policy 7 r*, given in The- 
orem 1, is a linear equation. If = 1, then the value 
function is a line in two dimensions. When | = 2, the 
value function is a plane in three dimensions. In general, 
\x-n } | = n and the value function is a hyperplane in n + 1 
dimensions. 

The optimal coverage set, then, is the set of hyperplanes 
that are highest in the n -I- 1 dimension for all points in 
the parameter space (first n dimensions). Figure 2 gives an 
example of planes over a 2-dimensional parameter space. 


Theorem 2. If two points x and y in n-dimensional pa- 
rameter space have the same corresponding optimal policy n, 
then all points on /(a) = x + a(y - x), 0 < a < 1, the line 
segment between x and y, have optimal policy n. 

Proof. 

Let 1 r = argmax^Kf (s 0 ) = argmax ff V^(s 0 ), 
z = /(a 0 ), 0 < ao < 1, and 
7 r' = argmax„„ V*„ (s 0 ). 

Assume V*(so) < V*,(s 0 ). We know V*(so) > V„ f ,(so), and 
because V (■) and /(•) are linear functions, we can compute 
their value at /( 1) = y by computing the unit slope. 


Vi (so) = 


Vi (so) - V?(so) 
Qo 


+ Vi (so) 


Therefore, 


GV(*[,n i )>GV(* u n 2 ) 




(a) (b) (c) (d) 


Figure 2: Intersecting Planes, (a) The first iteration checks the corners of the parameter space: (0,0), (0, 1), 
(1,0), (1, 1), which yields three planes. In the second iteration four new interesting points are found. The top 
three all have the same optimal plane, which is added in (b). The fourth point yields the plane added in (c). 
The next iteration produces eight new interesting points, two of which result in the sixth plane, added in 
(d). The next iteration finds no new optimal planes and terminates, returning the set of six policies. 


Vf,(so) = V " ,($o) V *' (So) + K?,(s 0 ) 

c*0 


K B M < VK(so) 

This contradicts that n is optimal at y, therefore 
V‘(so) = v:,{so). 

□ 

A bounded polyhedron in n dimensions is composed of a 
set of faces, which are bounded polyhedra in n - 1 dimen- 
sions. The corners of a bounded polyhedron are the points 
(polyhedra in 0 dimensions) that the polyhedron recursively 
reduces to. 


Theorem 3. Given a bounded polyhedron in n dimen- 
sions whose comers all have the same corresponding optimal 
policy 7r, any point on the surface or in the interior of that 
polyhedron also has optimal policy n. 

Proof, By induction on the number of dimensions 
Base case: n=l. A polyhedron in 1 dimension is a line 
segment with corners being the endpoints. FYom THEOREM 
2, all points on the line have optimal policy tt. 

Inductive case: Assume true for n - 1, show true for n. 
From the inductive assumption, all points in the faces of the 
polyhedron have optimal policy tt. For any point within the 
polyhedron, any line passing through that point intersects 
two faces, with the interior point between the intersection 
points. From THEOREM 2, this point has optimal policy 

TT. □ 

The part of the algorithm discussed in this section is han- 
dled by the function COVERAGE-SET in Figure 1. It takes 
an MDP and a joint reward structure and returns the opti- 
mal coverage set, based on Theorem 3. To illustrate how 
this works, we will step through a small example. 

Consider an instance of the Mars rover problem, with just 
two elements in the joint reward structure: (£}, Ef , c i) and 
( E \ , E % , C 2 ). The function CSA starts by calling COVERAGE- 
SET on MDP 1 and p. The first thing that COVERAGE- 
SET does is to create the boundaries of the parameter space. 


These are the hyperplanes that enclose the parameter space. 
Since each dimension is a probability, it can range from 0 
to 1, so in this case there are 4 boundary lines: x\ = 0, 
xi = 1, X 2 = 0, X 2 = 1. The algorithm then loops until no 
new planes are found. 

In each loop, INTERSECT is called on the set of known 
policy hyperplanes and boundary hyperplanes. INTERSECT 
takes a set of hyperplanes and returns a set of points that 
represent the intersections of those hyperplanes. Linear pro- 
gramming would be an efficient way to compute these inter- 
sections because the majority of them are not interesting : 
they lie outside the parameter space or lie below another 
plane. For example, Figure 2(d) has six policy planes and 
the four boundaries of the parameter space. The total num- 
ber of points is approximately 120, but only 14 points are 
visible in the top view, and they are the only interesting 
points because they form the corners of the polyhedra (or 
polygons in this example) over the parameter space whose 
interior all have the same optimal policy (THEOREM 3). 

After computing the set of points, the augmented MDP 
for each of those points is created and the optimal policy 
for each of those augmented MDPs is computed by SOLVE, 
which can use standard dynamic programming algorithms. 
The value of a policy and a point in parameter space is 

V^U) = P(£ 1 1 | 3 ri)P(£iMci + 

P{El\Tr 1 )P(Ej\7r 2 )c 2 + V, i (s l 0 ). 

For a given tt\ , the value function is a plane over the param- 
eter space. The plane for the new optimal policies will either 
be equivalent (different policy but same value) or equal to 
a plane already in the coverage set, or it will be better than 
every other plane in the coverage set at this point in param- 
eter space. If it is the latter case, this new plane is added 
to the coverage set. If a complete iteration does not find 
any new planes, then the loop terminates and the current 
coverage set is returned. 

3.3 Selecting the Optimal Joint Policy 

Given the optimal coverage set for agent i , finding the op- 


timal joint policy is easy. From PROPOSITION 1, an optimal 
joint policy includes an optimal policy for agent i given some 
augmented MDP. Since the optimal coverage set includes all 
of the optimal policies it includes one that is a part of an 
optimal joint policy. To find the optimal joint policy the al- 
gorithm finds the corresponding optimal policy for agent j 
for every policy in the optimal coverage set. It does this by 
creating an augmented MDP for agent j for each policy in 
the optimal coverage set and then finding the optimal policy 
of that MDP. The optimal joint policy is the resulting joint 
policy with the highest global value. 

The function GV returns the global value as defined in 
Definition 11. 

Theorem 4. The coverage set algorithm is complete and 
optimal. 

Proof, (sketch) To prove that the coverage set algorithm 
is complete and optimal, we must prove three claims: the 
algorithm terminates, it finds the optimal coverage set, and 
it returns the optimal joint policy. We avoid discussion of 
multiple equivalent policies for clarity. 

Termination - There are four loops in this algorithm. The 
first is over a set of policies, which is finite. The second 
is over the number of dimensions in the parameter space, 
which is finite. The third is looping until no new policies 
are found. The new policies are added to the coverage set 
each iteration, so a particular policy could never be a new 
policy twice. Since the set of policies is finite, this loop will 
terminate. The fourth loop is over a set of points, which are 
a subset of intersection points of a finite set of planes, which 
is finite. All of the loops in this algorithm halt, therefore 
the algorithm halts. 

Optimal coverage set is found - All the planes/policies 
in the returned set are derived by solving the corresponding 
MDP using dynamic programming and are therefore opti- 
mal. All the relevant point intersections between the hyper- 
planes are found. This set of points divides the parameter 
space into a set of polyhedra. From Theorem 3 if no new 
optimal policies are found from those points, then the set of 
optimal policies is the optimal coverage set. 

The optimal joint policy is returned - The set of joint 
policies created by taking an optimal coverage set and find- 
ing the corresponding optimal policy for the other agent 
includes all Nash equilibria. From PROPOSITION 1, the best 
of those equilibria is the joint optimal policy. □ 

4. EXPERIMENTAL RESULTS 

We have implemented a version of this algorithm that 
works on problems with a 2-dimensional parameter space, 
and we have tested it on a number of randomly generated 
problems. The test instances were modeled after the Mars 
rover example used throughout this paper. There are two 
rovers, each with an ordered set of sites to visit and collect 
data. The state for each rover is composed of their current 
task, and the current time left in the day. The action for 
each state is to skip the current site or to collect data at the 
current site. If the rover chooses skip , then it moves on to the 
next site without wasting any time. There is a distribution 
over time to collect data, but it always succeeds unless the 
time in the day runs out. 

The problem instances generated had a total time of 15 
hours and 6 sites for a total of 90 states for each rover. The 



Figure 3: Distribution over the size of the optimal 
coverage set. The rightmost bar is > 76. 



Figure 4: Number of value iterations by the size of 
the optimal coverage set. 


reward for skipping a task was 0, and for collecting at a site 
was randomly chosen from a uniform distribution between 
0.1 and 1.0. The time to collect was a Gaussian distribution 
with a mean between 4.0 and 6.0 and a variance 40% of the 
mean. Two of the sites (3 and 4) had a reward that was 
superadditive in the global value, ci and C 2 were randomly 
chosen from a uniform distribution between 0.05 and 0.5. 

Data was collected from 4208 random problems. About 
11% of the underlying MDPs were trivial, that is, the op- 
timal coverage set only included one policy. At the other 
extreme, about 7% of the underlying MDPs were hard, and 
had a large number of policies in the optimal coverage set. 
A problem is composed of two underlying MDPs, and the 
complexity of a problem is really only as hard as the easiest 
of the two, which means that about 21% of the problems 
were trivial and less than 1% were hard. In our tests, we 
considered an optimal coverage set of size > 76 to be hard. 
The average size (not including the hard problems) was 13.3. 
The full distribution can been seen in Figure 3. 

Figure 4 demonstrates how this algorithm scales with the 
size of the optimal coverage set. The average number of 
value iterations was 1963. The number of reachable states 
in this problem ranged from 70 to 75. To solve this using 
brute force policy search would result in at least 2 70 policy 
evaluations. While policy evaluation is usually cheaper than 
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value iteration, it is still clearly infeasible for these problems. 

To collect this data, we used a brute force intersection al- 
gorithm that found all of the intersection points and weeded 
out only those that were out of bounds. It did not check to 
see whether the points were below another already known 
plane. We also cached the results of each run of value it- 
eration and never ran it twice on the same point. We are 
working on a more efficient implementation of the function 
INTERSECT that will not need to intersect all planes to 
find the interesting points. This will lead to both a faster 
INTERSECT function, and fewer runs of value iteration. 

5. CONCLUSION 

The framework of decentralized MDPs has been proposed 
to model cooperative multi-agent systems in which agents 
receive only partial information about the world. Comput- 
ing the optimal solution to the general class is NEXP-hard, 
and the only known algorithm is brute force policy search. 
We have identified an interesting subset of problems called 
transition-independent DEC-MDPs, and designed and im- 
plemented an algorithm that returns the optimal joint policy 
for these problems. 

Besides being the first optimal algorithm to solve any non- 
trivial class of DEC-MDPs, the new algorithm can help es- 
tablish a baseline for evaluating fast approximate solutions. 
Using the exact algorithm, other experiments have shown 
that a simple hill-climbing algorithm with random restarts 
performs quite well. In many cases, it finds the optimal joint 
policy very quickly, which we would not have known with- 
out identifying the optimal solution using the coverage set 
algorithm. 

The new algorithm performed well on randomly gener- 
ated problems within a simple experimental testbed. A 
more comprehensive evaluation is the focus of future work. 
This will include a formal analysis of the algorithm’s running 
time, and testing the algorithm with more complex problem 
instances. This algorithm is also naturally an anytime algo- 
rithm, because the COVERAGE-SET function could termi- 
nate at any time and return a subset of the optimal coverage 
set. We will explore this more in future work. We also plan 
to explore the range of problems that can be modeled within 
this framework. For example, one problem that seems to fit 
the framework involves a pair of agents acting with some 
shared resources. If together they use more than the to- 
tal available amount of one of the resources, they incur a 
penalty representing the cost of acquiring additional units 
of that resource. 

Finally, the optimal coverage set is an efficient represen- 
tation of the changes in the environment that would cause 
an agent to adopt a different policy. This information could 
be extremely useful in deciding when and what to commu- 
nicate or negotiate with the other agents. In future work, 
we will explore ways to use this representation in order to 
develop communication protocols that are sensitive to the 
cost of communication. 
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